ImputeDB: Data Imputation as a Query Optimization

نویسندگان

  • Jose Cambronero
  • John K. Feser
  • Micah Smith
چکیده

In order to study the placement of an imputation step, we create a logical imputation operator (along with respective physical instances) and incorporate its placement as part of the query plan optimization process in SimpleDB. We introduce measures for information loss and runtime for imputation operations, which outline the main trade-offs in the imputation placement. We add these measures into our cost estimation, allowing us to intelligently place the data imputation step during query planning. We show the trade-offs between efficiency and accuracy for simple data imputation models.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Query Optimization for Dynamic Imputation

Missing values are common in data analysis and present a usability challenge. Users are forced to pick between removing tuples withmissing values or creating a cleaned version of their data by applying a relatively expensive imputation strategy. Our system, ImputeDB, incorporates imputation into a costbased query optimizer, performing necessary imputations onthe-fly for eachquery. This allows u...

متن کامل

Missing data imputation in multivariable time series data

Multivariate time series data are found in a variety of fields such as bioinformatics, biology, genetics, astronomy, geography and finance. Many time series datasets contain missing data. Multivariate time series missing data imputation is a challenging topic and needs to be carefully considered before learning or predicting time series. Frequent researches have been done on the use of diffe...

متن کامل

Improved Search Results Of Keyword Query Using Data Imputation Approach

Keyword queries over databases offer simple access to data, but this frequently suffer from low quality of ranking. It would be beneficial to categorize queries that are likely to have the low ranking quality to improve the user satisfaction. For example, the system may recommend to the user alternate queries for such hard queries. In this report, the characteristics of hard queries are analyze...

متن کامل

Relational Databases Query Optimization using Hybrid Evolutionary Algorithm

Optimizing the database queries is one of hard research problems. Exhaustive search techniques like dynamic programming is suitable for queries with a few relations, but by increasing the number of relations in query, much use of memory and processing is needed, and the use of these methods is not suitable, so we have to use random and evolutionary methods. The use of evolutionary methods, beca...

متن کامل

Accuracy evaluation of different statistical and geostatistical censored data imputation approaches (Case study: Sari Gunay gold deposit)

Most of the geochemical datasets include missing data with different portions and this may cause a significant problem in geostatistical modeling or multivariate analysis of the data. Therefore, it is common to impute the missing data in most of geochemical studies. In this study, three approaches called half detection (HD), multiple imputation (MI), and the cosimulation based on Markov model 2...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017